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SUPERSCALAR RISC INSTRUCTION SCHEDULING 



Inventors: Sanjiv Garg 

Kevin Ray ladonato 
Le Trong Nguyen 
Johannes Wang 

CROSS-REFERENCE TO RELATED APPLICATIONS 

The following are commonly owned, co-pending applications: 

• "Semiconductor Floor Plan and Method for a Register Renaming 

Circuit", Serial No. , concurrently filed with the 

present application (Attorney Docket No. SP041/1397.0190000); 

• "High Performance RISC Microprocessor Architecture", Serial No. 
07/817,810, filed 1/8/92 (Attorney Docket No. SP015/ 
1397.0280001); 

• "Extensible RISC Microprocessor Architecture", Serial No. 
07/817,809, filed 1/8/92 (Attorney Docket No. SP021/1397.0300001). 

The disclosures of the above applications are incorporated herein by 
reference. 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

The present invention relates to superscalar reduced instruction set 
computers (RISC), more particularly, the present invention relates to 
instruction scheduling including register renaming and instruction issuing 
for superscalar RISC computers. 
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2. Related Art 

A more detailed description of some of the basic concepts 
discussed in this application is found in a number of references, including 
Mike Johnson, Superscalar Microprocessor Design (Prentice-Hall, Inc., 

5 Bnglewood Cliffs, New Jersey, 1991); John L. Hennessy et al., Computer 
Architecture - A Quantitative Approach" (Morgan Kaufmann Publishers, 
Inc., San Mateo, California, 1990). Johnson's text, particularly Chapters 
2, 6 and 7 provide an excellent discussion of the register renaming issues 
addressed by the present invention. 

10 A major consideration in a superscalar RISC processor is to how 

to execute multiple instructions in parallel and out-of-order, without 
mcurring data errors due to dependencies inherent in such execution. 
Data dependency checking, register renaming and instruction scheduling 
are integral aspects of the solution. 

15 2.1 Storage Conflicts and Register Renaming 

True dependencies (sometimes called "flow dependencies" or 
"write-read" dependencies) are often grouped with anti-dependencies (also 
called "read-write 1 ' dependencies) and output dependencies (also called 
Vrite-write" dependencies) into a single group of instruction 

20 dependencies. The reason for this grouping is that each of these 
dependencies manifests itself through use of registers or other storage 
locations. However, it is important to distinguish true dependencies from 
the other two. True dependencies represent the flow of data and 
information through a program. Anti- and output dependencies arise 

25 because, at different points in time, registers or other storage locations 
hold different values for different computations. 

When instructions are issued in order and complete in order, there 
is a one-to-one correspondence between registers and values. At any 
given point in execution, a register identifier precisely identifies the value 
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contained in the corresponding register. When instructions are issued out 
of order and complete out of order, correspondence between registers and 
values breaks down, and values conflict for registers. This problem is 
severe when the goal of register allocation is to keep as many values in as 
5 few registers as possible. Keeping a large number of values in a small 
number of registers creates a _large number .of conflicts when the 
execution order is changed from the order assumed by the register 
allocator. 

And- and output dependencies are more properly called "storage 
10 conflicts" because reusing storage locations (including registers) causes 
instructions to interfere with one another even though conflicting 
instructions are otherwise independent Storage conflicts constrain 
instruction issue and reduce performance. But storage conflicts, like other 
resource conflicts, can be reduced or eliminated by duplicating the 
15 troublesome resource. 

2.2 Dependency Mechanisms 

Johnson also discusses in detail various dependency mechanisms, 
including: software, register renaming, register renaming with a reorder 
buffer, register renaming with a future buffer, interlocks, the copying of 
20 operands in the instruction window to avoid dependencies, and partial 
renaming. 

A conventional hardware implementation relies on software to 
enforce dependencies between instructions. A compiler or other code 
generator can arrange the order of instructions so that the hardware 

25 cannot possibly see an instruction until it is free of true dependencies and 
storage conflicts. Unfortunately, this approach runs into several problems. 
Software does not always know the latency of processor operations, and 
thus, cannot always know how to arrange instructions to avoid 
dependencies. There is the question of how the software prevents the 

30 hardware from seeing an instruction until it is free of dependencies. In 
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a scalar processor with low operation latencies, software can insert 
"no-ops" in the code to satisfy data dependencies without too much 
overhead. If the processor is attempting to fetch several instructions per 
cycle, or if some operations take several cycles to complete, the number 
5 of no-ops required to prevent the processor from seeing dependent 
instructions rapidly becomes excessive, causing an unacceptable increase 
in code size. The no-ops use a precious resource, the instruction cache, 
to encode dependencies between instructions. 

When a processor permits out-of-order issue, it is not at all clear 

10 what mechanism software should use to enforce dependencies. Software 
has little control over the behavior of the processor, so it is hard to see 
how software prevents the processor from decoding dependent 
instructions. The second consideration is that no existing binary code for 
any scalar processor enforces the dependencies in a superscalar processor, 

15 because the mode of execution is very different in the superscalar 
processor. Refying on software to enforce dependencies requires that the 
code be regenerated for the superscalar processor. Finally, the 
dependencies in the code are directly determined by the latencies in the 
hardware, so that the best code for each version of a superscalar processor 

20 depends on the implementation of that version. 

On the other hand, there is some motivation against hardware 
dependency techniques, because they are inherently complex. Assuming 
instructions with two input operands and one output value, as holds for 
typical RISC instructions, then there are five possible dependencies 

25 between any two instructions: two true dependencies, two 
anti-dependencies, and one output dependency. Furthermore, the number 
of dependencies between a group of instructions, such as a group of 
instructions in a window, varies with the square of the number of 
instructions in the group, because each instruction must be considered 

30 against every other instruction. 

Complexity is further multiplied by the number of instructions that 
the processor attempts to decode, issue, and complete in a single cycle. 

SP035/1397.0170000 



10 



20 



These actions introduce dependencies. The only aid in reducing 
complexity is that the dependencies can be determined incrementally, over 
many cycles to help reduce the scope and complexity of the dependency 
hardware. 

One technique for removing storage conflicts is by providing 
additional registers that are used to reestablish the correspondence 
between registers and values. The additional registers are conventionally 
allocated dynamically by hardware, and the registers are associated with 
values needed by the program using "register renaming." To implement 
register renaming, processors typically allocate a new register for every 
new value produced (i.e., for every instruction that writes a register). An 
instruction identifying the original register, for the purpose of reading its 
value, obtains instead the value in the newly allocated register. Thus, 
hardware renames the original register identifier in the instruction to 
identify the new register and correct value. The same register identifier 
in several different instructions may access different hardware registers, 
depending on the locations of register references with respect to register 
assignments. 

Consider the following code sequence where "op" is an operation, 
"Rn" represents a numbered register, and ":=" represents assignment: 



R3b := R3a op R5a (1) 

R4b := R3b + 1 (2) 

R3c := R5a +1 (3) 

R7b := R3c op R4b (4) 
25 Each assignment to a register creates a new "instance" of the 

register, denoted by an alphabetic subscript The creation of a new 
instance for R3 in the third instruction avoids the anti- and output 
dependencies on the second and first instructions, respectively, and yet 
does not interfere with correctly supplying an operand to the fourth 
30 instruction. The assignment to R3 in the third instruction supersedes the 
assignment to R3 in the first instruction, causing R3c to become the new 
R3 seen by subsequent instructions until another instruction assigns a 
value to R3. 
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Hardware that performs renaming creates each new register 
instance and destroys the instance when its value is superseded and there 
are no outstanding references to the value. This removes anti- and output 
dependencies and allows more instruction parallelism. Registers are still 
5 reused, but reuse is in line with the requirements of parallel execution. 
This is particularly helpful with out-of-order issue, because storage 
conflicts introduce instruction issue constraints that are not really 
necessary to produce correct results. For example, in the preceding 
instruction sequence, renaming allows the third instruction to be issued 

10 immediately, whereas, without renaming, the instruction must be delayed 
until the first instruction is complete and the second instruction is issued. 

Another technique for reducing dependencies is to associate a 
single bit (called a "scoreboard bit") with each register. The scoreboard 
bit is used to indicate that a register has a pending update. When an 

15 instruction is decoded that will write a register, the processor sets the 
associated scoreboard bit The scoreboard bit is reset when the write 
actually occurs. Because there is only one scoreboard bit indicating 
whether or not there is a pending update, there can be only one such 
update for each register. The scoreboard stalls instruction decoding if a 

20 decoded instruction will update a register that already has a pending 
update (indicated by the scoreboard bit being set). This avoids output 
dependencies by allowing only one pending update to a register at any 
given time. 

Register renaming, in contrast, uses multiple-bit tags to identity the 
25 various uncomputed values, some of which values may be destined for the 
same processor register (that is, the same program-visible register). 
Conventional renaming requires hardware to allocate tags from a pool of 
available tags that are not currently associated with any value and requires 
hardware to free the tags to the pool once the values have been 
30 computed. Furthermore, since scoreboarding allows only one pending 
update to a given register, the processor is not concerned about which 
update is the most recent 
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A further technique for reducing dependencies is using register 
renaming with a "reorder buffer" which uses associative lookup. Hie 
associative lookup maps the register identifier to the reorder buffer entry 
as soon as the entry is allocated, and, to avoid output dependencies, the 
5 lookup is prioritized so that only the value for the most recent assignment 
is obtained if the register is assigned more than once. A tag is obtained 
if the result is not yet available. Hiere can be as many instances of a 
given register as there are reorder buffer entries, so there are no storage 
conflicts between instructions. The values for the different instances are 

10 written from the reorder buffer to the register file in sequential order. 
When the value for the final instance is written to the register file, the 
reorder buffer no longer maps the register; the register file contains the 
only instance of the register, and this is the most recent instance. 

However, renaming with a reorder buffer relies on the associative 

15 lookup in the reorder buffer to map register identifiers to values. In the 
reorder buffer, the associative lookup is prioritized so that the reorder 
buffer always provides the most recent value in the register of interest (or 
a tag). The reorder buffer also writes values to the register file in order, 
so that, if the value is not in the reorder buffer, the register file must 

20 contain the most recent value. 

In a still further technique for reducing dependencies, associative 
lookup can be eliminated using a "future file." The future file does not 
have the properties of the reorder buffer discussed in the preceding 
paragraph. A value presented to the future file to be written may not be 

25 the most recent value destined for the corresponding register, and the 
value cannot be treated as the most recent value unless it actually is. The 
future file therefore keeps track of the most recent update and checks 
that each write corresponds to the most recent update before it actually 
performs the write. 

30 When an instruction is decoded, it accesses tags in the future file 

along with the operand values. If the register has one or more pending 
updates, the tag identifies the update value required by the decoded 
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instruction. Once an instruction is decoded, other instructions may 
overwrite this instructions's source operands without being constrained by 
anti-dependencies, because the operands are copied into the instruction 
window. Output dependencies are handled by preventing the writing as 

5 a result into the future file if the result does not have a tag for the most 
recent value. Both anti- and output dependencies are handled without 
stalling instruction issue. 

If dependencies are not removed through renaming, "interlocks" 
must use to enforce dependencies. An interlock simply delays the 

LO execution of an instruction until the instruction is free of dependencies. 
There are two ways to prevent an instruction from being executed: one 
way is to prevent the instruction from being decoded, and the other is to 
prevent the instruction from being issued. 

To improve performance over scoreboarding, interlocks are moved 

15 from the decoder to the instruction window using a "dispatch stack." The 
dispatch stack is an instruction window that augments each instruction in 
the window with dependency counts. There is a dependency count 
associated with the source register of each instruction in the window, 
giving the number of pending prior updates to the source register and 

tO thus the number of updates that must be completed before all possible 
true dependencies are removed. There are two similar dependency counts 
associated with the destination register of each instruction in the window, 
giving both the number of pending prior uses of the register (which is the 
number of anti-dependencies) and the number of pending prior updates 

15 to the register (which is the number of output dependencies). 

When an instruction is decoded and loaded into the dispatch stack, 
the dependency counts are set by comparing the instruction's register 
identifiers with the register identifiers of all instructions already in the 
dispatch stack. As instructions complete, the dependency counts of 

10 instructions that are still in the window are decremented based on the 
source and destination register identifiers of completing instructions (the 
counts are decremented by a variable amount, depending on the number 
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of instructions completed). An instruction is independent when all of its 
counts are zero. The use of counts avoids having to compare all 
instructions in the dispatch stack to all other instructions on every cycle. 
Anti-dependencies can be avoided altogether by copying operands 

5 to the instruction window (for example, to the reservation stations) during 
instruction decode. In this manner, the operands cannot be overwritten 
by subsequent register updates. Operands can be copied to eliminate 
anti-dependencies in any approach, independent of register renaming. 
The alternative to copying operands is to interlock anti-dependencies, but 

10 the comparators and/or counters required for these interlocks are costly, 
considering the number of combinations of source and result registers to 
be compared. 

A tag can be supplied for the operand rather than the operand 
itself. This tag is simply a means for the hardware to identify which value 

15 the instruction requires, so that, when the operand value is produced, it 
can be matched to the instruction. If there can be only one pending 
update to a register, the register identifier can serve as a tag (as with 
scoreboarding). If there can be more than one pending update to a 
register (as with renaming), there must be a mechanism for allocating 

20 result tags and insuring uniqueness. 

An alternative to scoreboarding interlocking is to allow multiple 
pending updates of registers to avoid stalling the decoder for output 
dependencies, but to handle anti-dependencies by copying operands (or 
tags) during decode. An instruction in the window is not issued until it 

25 is free of output dependencies, so the updates to each register are 
performed in the same order in which they would be performed with in- 
order completion, except that updates for different registers are out of 
order with respect to each other. This alternative has almost all of the 
capabilities of register renaming, lacking only the capability to issue 

30 instructions so that updates to the same register occur out of order. 

There appears to be no better alternative to renaming other than 
with a reorder buffer. Underlying the discussion of dependencies has 
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been the assumption that the processor performs out-of-order issue and 
already has a reorder buffer for recovering from mispredicted branches. 
Out-of-order issue makes it unacceptable to stall the decoder for 
dependencies. If the processor has an instruction window, it is 
inconsistent to limit the look ahead capability of the processor by 
interlocking the decoder. There are then only two alternatives: implement 
anti- and output dependency interlocks in the window or remove these 
altogether with renaming. 



SUMMARY OF THE INVENTION 

The present invention is directed to instruction scheduling 
including register renaming and instruction issuing for superscalar RISC 
computers. A Register Rename Circuit (RRC), which is part of the 
scheduling logic allows a computer's Instruction Execution Unit (DEU) to 
execute several instructions at the same time while avoiding dependencies. 
In contrast to conventional register renaming, the present invention does 
not actually rename register addresses. The RRC of the present invention 
temporarily buffers the instruction results, and the results of out-of-order 
instruction execution are not transferred to the register file until all 
previous instructions are done. The RRC also performs result forwarding 
to provide temporarily buffered operands (results) to dependant 
instructions. The RRC contains three subsections: a Data Dependency 
Checker (DDC), Tag Assign Logic (TAL) and Register file Port MUXes 
(RPM). 

The function of the DDC is to locate the dependencies between 
the instructions for. a group of instructions. The DDC does this by 
comparing the addresses of the source registers of each instruction to the 
addresses of the destination registers of each previous instruction in the 
group. For example, if instruction A reads a value from a register that is 
written to by instruction B, then instruction A is dependent upon 
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instruction B and instruction A cannot start until instruction B has 
finished. The DDC outputs indicate these dependencies. 

The outputs of the DDC go to the TAL. Because it is possible for 
an instruction to be dependent on more than one previous instruction, the 

5 TAL must determine which of those previous instructions will be the last 
one to be executed. The present invention automatically maps each 
instruction a predetermined temporary buffer location; hence, the present 
invention does not need prioritized associative look-up as used by 
convention reorder buffers, thereby saving chip area/cost and execution 

10 speed. 

Out-of-order results for several instructions being executed at the 
same time are stored in a set of temporary buffers, rather that the file 
register designated by the instruction. If the DDC determines, for 
example, that a register that instruction 6's source is written to by 

15 instructions 2, 3 and 5, then the TAL will indicate that instruction 6 must 
wait for instruction 5 by outputting the "tag" of instruction 5 for 
instruction 6. The tag of instruction 5 shows the temporary buffer 
location where instruction 5's result is stored. It also contains a one bit 
signal (called a "done flag") that indicates if instruction 5 is finished or 

20 not The TAL will output three tags for each instruction, because each 
instruction can have three source registers. If an instruction is not 
dependent on any previous instruction, the TAL will output the register 
file address of the instruction's input, rather an a temporary buffer's 
address. 

25 The last part of the RRC are the RPMs or Register file Port 

MUXes. The inputs of the RPMs are the outputs of the TAL, and the 
select lines for the RPMs come from another part of the DBU called the 
Instruction Scheduler or Issuer. The Instruction Scheduler chooses which 
instruction to execute (this decision is based partly on the done flags) and 

30 then uses the RPMs to select the tags of that instruction. These tags go 
to the read address ports of the computer's register files. In the previous 
example, once instruction 5 has finished, the Instruction Scheduler will 
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start instruction 6. It will select the RPM so that the address of 
instruction 5's result (its tag) is sent to the register file, and the register 
file will make the result of instruction 5 available to instruction 6. 

Hie foregoing and other features and advantages of the present 
5 invention will be apparent from the following more particular description 
of the preferred embodiments of the invention, as illustrated in . the 
accompanying drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The invention will be better understood if reference is made to the 
10 accompanying drawings. 

FIG. 1 shows a representative high level block diagram of the 
register renaming circuit of the present invention. 

FIG. 2 shows a representative block diagram of the data 
dependency check circuit of the present invention. 
15 FIG. 3 shows a representative block diagram'of the tag assignment 

logic of the present invention. 

FIG. 4 shows a representative block diagram of the register port 
file multiplexers of the present invention. 

FIG. 5 is a representative flowchart showing a data dependency 
20 check method for IXS1 and IYS/D in accordance with the present 
invention. 

FIG's. 6A and 6B are representative flowcharts showing a tag 
assignment method in accordance with the present invention. 

FIG. 7 shows a representative block diagram which compares an 
25 instruction "Vs source/destination operand with each operand of an 
instruction X in accordance with an embodiment of the present invention. 

FIG. 8 shows a representative circuit diagram for comparator block 
706 of FIG. 7. 

FIG. 9 shows a representative block diagram of a Priority Encoder 
30 in accordance with an embodiment of the present invention. 
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FIG. 10 shows a representative block diagram of the instruction 
scheduling logic of the present invention. 

TWRTATTTO DESCRIPTION 

FIG. 1 shows a representative high level block diagram of an 

5 Instruction Execution Unit (IEU) 100 associated with the present 
invention. The goal of IEU 100 is to execute as many instructions as 
possible in the shortest amount of time. There are two basic ways to 
accomplish this: optimize IEU 100 so that each instruction takes as little 
time as possible or optimize IEU 100 so that it can execute several 

10 instructions at the same time. 

Instructions are sent to IEU 100 from an Instruction Fetch Unit 
(IFU, not shown) through an instruction FIFO (first-in-first-out register 
stack storage device) 101 in groups of four called "buckets." IEU 100 can 
decode and schedule up to two buckets of instructions at one time. FIFO 

15 101 stores 16 total instructions in four buckets labeled 0-3. IEU 100 looks 
at the an instruction window 102. In one embodiment of the present 
invention, window 102 comprises eight instructions (buckets 0 and 1). 
Every cycle IEU 100 tries to issue a maximum number of instructions 
from window 102. Window 102 functions as a instruction buffer register. 

20 Once the instructions in a bucket are executed and their results stored in 
the processor's register file (see block 117), the bucket is flushed out a 
bottom 104 and a new bucket is dropped in at a top 106. 

In order to execute instructions in parallel or out of order, care 
must be taken so that the data that each instruction needs is available 

25 when the instruction needs it and also so that the result of each 
instruction is available for any future instructions that might need it A 
Register Rename Circuit (RRC), which is part of the scheduling logic of 
the computer's IEU performs this function by locating dependencies 
between current instructions and then renaming the sources (inputs) of 

30 the instruction. 
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As noted above, there are three types of dependencies: input 
dependencies, oulput dependencies and anti-dependencies. Input 
dependencies occur when an instruction, call it A, that performs an 
operation on the result of a previous instruction, call it B. Output 
5 dependencies occur when the outputs of A and B are to be stored in the 
same place. Anti-dependencies occur when instruction A comes before 
B in the instruction stream and B*s result will be stored in the same place 
as one of A's inputs. 

Input dependencies are handled by not executing instructions until 

10 their inputs are available. RRC 112 is used to locate the input 
dependencies between current instructions and then to signal an 
Instruction Scheduler or Issuer 118 when all inputs for a particular 
instruction are ready. In order to locate these dependencies, RRC 112 
compares the register file addresses of each instruction's inputs with the 

15 addresses of each previous instruction's output using a data dependency 
circuit (DDC) 108. If one instruction's input comes from a register where 
a previous instruction's output will be stored, then the latter instruction 
must wait for the former to finish. 

This implementation of RRC 112 can check eight instructions at 

20 the same time, so a current instruction is defined as any one of those 
eight from window 102. It should become evident to those skilled in the 
art that the present invention can easily be adapted to check more or less 
instructions. 

In one embodiment of the present invention, instructions can have 
25 from 0 to 3 inputs and 0 or 1 outputs. Most instructions' inputs and 
outputs come from, or are stored in, one of several register files. Each 
register file 117 (e.g., separate integer, floating and boolean register files) 
has 32 real entries plus the group of 8 temporary buffers 116. When an 
instruction completes, (The term "complete" means that the operation is 
30 complete and the operand is ready to be written to its destination 
register.) its result is stored in its preassigned location in the temporary 
buffers 116. Its result is later moved to the appropriate place in register 
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file 117 after all previous instructions' results have been moved to their 
places in the register file. This movement of results from temporary 
buffers 116 to register file 117 is called "retirement* and is controlled by 
termination logic, as should become evident to those skilled in the art 
More than one instruction may be retired at a time. Retirement 
comprises updating the "official state" of the machine, including the 
computer's Program Counter, as will become evident to those skilled in 
the art For example, if instruction 10 happens to complete directly before 
instruction II, both results can be stored directly into register file 117. 
But if instruction 13 then completes, its result must be stored in temporary 
buffer 116 until instruction 12 completes. By having IEU 100 store each 
instruction's result in its preassigned place in the temporary buffers 116, 
IEU 100 can execute instructions out of program order and still avoid the 
problems caused by output and anti-dependencies. 

RRC 112 sends a bit map to an Instruction Scheduler 118 via a bus 
120 indicating which instructions in window 102 are ready for issuing. 
Instruction decode logic (not shown) indicates to Issuer 118 the resource 
requirements for each instruction over a bus 123. For each resource in 
IEU 100 (e.g., each functional unit being an adder, multiplier, shifter, or 
the like), Issuer 118 scans this information and selects the first and 
subsequent instructions for issuing by sending issue signals over bus 121. 
The issue signals select a group of Register File Port MUXes (RPMs) 124 
inside RRC 112 whose inputs are the addresses of each instruction's 
inputs. 

Because the results may stay in temporary buffer 116 several cycles 
before going to register file 117, a mechanism is provided to get results 
from temporary buffer 116 before they go to register file 117, so the 
information can be used as operands for other instructions. This 
mechanism is called "result forwarding," and without it, Issuer 118 would 
not be able to issue instructions out of order. This result forwarding is 
done in register file 117 and is controlled by RRC 112. The control 
signals necessary for performing the result forwarding will be come 
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evident to those skilled in the art, as should the random logic used for 
generating such control signals. 

If an instruction is not dependent on any of the current instructions 
result forwarding is not necessary since the instruction's inputs are already 
5 in register file 117. When Issuer 118 decides to execute that instruction, 
RRC 112 tells register file 117 to output its data. 

RRC 112 contains three subsections: a Data Dependency Checker 
(DDC) 108, Tag Assign Logic (TAL) 122 and Register File Port MUXes 
(RPM) 124. DDC 108 determines where the input dependencies are 

10 between the current instructions. TAL 122 monitors the dependencies for 
Issuer 118 and controls result forwarding. RPM 124 is controlled by 
Issuer 118 and directs the outputs of TAL 122 to the appropriate register 
file address ports 119. Instructions are passed to DDC 108 via bus 110. 
All source registers are compared with all previous destination registers 

15 for each instruction in window 102. 

Each instruction has only one destination, which may be a double 
register in one embodiment An instruction can only depend on a 
previous instruction and may have up to three source registers. There are 
various register file source and destination addresses that need to be 

20 checked against each other for any dependencies. As noted above, the 
eight bottom instructions corresponding to the lower two buckets are 
checked by DDC 108. All source register addresses are compared with 
all previous destination register addresses for the instructions in window 
102. 

25 For example, let's say a program has the following instruction 

sequence: 

add R0, Rl, R2 (0) 
add R0, R2, R3 (1) 
add R4, R5, R2 (2) 
30 add R2, R3, R4 (3) 

The first two registers in each instruction 0-3 are the source 
registers, and the last listed register in each instruction is the destination 
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register. For example, RO and Rl are the source registers for instruction 
0 and R2 is the destination register. Instruction 0 adds the contents of 
registers 0 and 1 and stores the result in R2. For instructions 1-3 in this 
example, the following are the comparisons needed to evaluate all of the 
dependencies: 

I1S1, 11S2 vs. IOD 
I2S1, 12S2 vs. I1D, IOD 
BS1, BS2 vs. I2D, I1D, IOD 

The key to the above is as follows: IXRS1 is the address of source 
(input) number 1 of instruction X; DCRS2 is the address of source (input) 
number 2 of instruction X; and DQD is the address of the destination 
(output) of instruction X. 

Note also that RRC 112 can ignore the fact that instruction 2 is 
output dependent on instruction 0, because the processor has a temporary 
buffer where instruction 2's result can be stored without interfering with 
instruction O's result As discussed before, instruction 2's result will not 
be moved from temporary buffers 116 to register file 117 until instructions 
0 and l's results are moved to register file 117. 

The number of instructions that can be checked by RRC 112 is 

easily scaleable. In order to check eight instructions at a time instead of 

four, the following additional comparisons would also need to be made: 

I4S1, 14S2 vs I3D, I2D, I1D, IOD 

I5S1, I5S2 vs I4D, BD, I2D, I1D, IOD 

I6S1, I6S2 vs I5D, I4D, BD, I2D, I1D, IOD 

I7S1, I7S2 vs I6D, I5D, I4D, BD, I2D, I1D, IOD 

There are several special cases that RRC 112 must handle in order 
to do the dependency check. First, there are some instructions that use 
the same register as an input and an output Thus, RRC 112 must 
compare this source/destination register address with the destination 
register addresses of all previous instructions. So for instruction 7, the 
following comparisons would be necessary: 

I7S1,I7S2,I7S/D vs. I6D,KD,I4D,BD,I2D,I1D,I0D. 
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Another special case occurs when a program contains instructions 
that generate 64 bit outputs (called long-word operations). These 
instructions need two registers in which to store their results. In this 
embodiment, these registers must be sequential. Thus if RRC 112 is 
checking instruction 4's dependencies and instruction 1 is a long-word 
operation, then it must do the following comparisons: 
I4S1.I4S2 vs. BD,I2D,I1D,I1D+1,I0D 

Sometimes, instructions do not have destination registers. Thus 
RRC 112 must ignore any dependencies between instructions without 
destination registers and any future instructions. Also, instructions may 
not have only one valid source register, so RRC 112 must ignore any 
dependencies between the unused source register (usually S2) and any 
previous instructions. 

RRC 112 is also capable of dealing with multiple register files. 
When using multiple register files, dependencies only occur when one 
instruction's source register has the same address and is in the same 
register file as some other instruction's destination register. RRC 112 
treats the information regarding which register file a particular address is 
from as part of the address. For example, in an implementation using 
four 32 bit register files, RRC 112 would do 7 bit compares instead of 5 
bit compares (5 for the address and 2 for the register file). 

Signals indicating which instructions are long-word operations or 
have invalid source or destination registers are sent to RRC 112 from 
Instruction Decode Logic (IDL; not shown). DDL also tells RRC 112 
which register file each instruction's sources and destinations will come 
from or go to. 

A block diagram of DDC 108 is shown in FIG. 2. Source address 
signals arrive from EFTFO 101 for all eight instructions of window 102. 
Additional inputs include long-word load operation flags, register file 
decode signals, invalid destination register flags, destination address 
signals and addressing mode flags for all eight instructions. 
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DDC 208 comprises 28 data dependency blocks 204. Each block 
204 is described in a KEY 206. Each block 204 receives 3 inputs, IXS1, 
IXS2 and DCS/D. IXS1 is the address of source (input) number 1 of 
instruction X, DCS2 is the address of source (input) number 2 of 
instruction X; and IXS/D is the address of the source/destination (input) 
of instruction X. Each block 204 also receives input IYS/D, which is the 
destination register address for some previous instruction Y. A top row 
208, for example, receives I0S/D, which is the destination register address 
for instruction 0. Each block 204 outputs the data dependency results to 
one of a corresponding bus line 114. For example, the address of I2S/D 
must be checked with operand addresses SI, S2 and S/D of instructions 
7, 6, 5, 4, and 3. 

Each block 204 performs the three comparisons. To illustrate 
these comparisons, consider a generic block 700 shown in FIG. 7, which 
compares instruction Y's source/destination operand with each operand 
of instruction X. In this example, the three following comparisons must 
be made: 

IXS1 = IYS/D 
IXS2 = IYS/D 
IXS/D = IYS/D 

These comparisons are represented by three comparator blocks 
702, 704 and 706, respectively. One set of inputs to comparator blocks 
702, 704 and 706 are the bits of the IYS/D field, which is represented by 
number 708. Comparator block 702 has as its second set of inputs the 
bits of the IXS1. Similarly, comparator block 704 has as its second set of 
inputs the bits of the DCS1, and comparator block 706 has as its second 
set of inputs the bits of the IXS/D. 

In a preferred embodiment, the comparisons performed by blocks 
702, 704 and 706 can be performed by random logic. An example of 
random logic for comparator block 706 is shown in FIG. 8. Instruction 
Y's source/destination bits [6:0] are shown input from the right at 
reference number 802 and instruction X's source/destination bits [6:0] are 
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shown input from the top at reference number 804. The most significant 
bit (MSB) is bit 6 and the least significant bit (LSB) is bit 0. The 
corresponding bits from the two operands are fed to a set of seven 
exclusive NOR gates (XNORs) 806. The outputs of XNORs 806 are then 
ANDed by a seven input AND gate 808. If the corresponding bits are the 
same, the output of XNOR 806 will be logic high. When all bits are the 
same, all seven XNOR 806 outputs are logic high and the omput of AND 
gate 808 is logic high, this indicates that there is a dependency between 
IXS/D and IYS/D. 

The random logic for comparator blocks 702 and 704 will be 
identical to that shown in FIG. 8. The present invention contemplates 
many other random logic circuits for performing data dependency 
checking, as will become evident to those skilled in the art without 
departing from the spirit of this example. 

As will further become evident to those skilled in the art, various 
implementation specific special cases can arise which require additional 
random logic to perform data dependency checking. An illustrative 
special data dependency checking case is for long word handling. 

As mentioned before, if a long word operation writes to register X, 
the first 32 bits are written to register X and the second 32 bits are 
written to register X+l. The data dependency checker therefore needs 
to check both resisters when doing a comparison. In a preferred 
embodiment, register X is an even register, X+l is an odd register and 
thus they only differ by the LSB. The easiest way to check both registers 
at the same time is to simply ignore the LSB. In the case of a store long 
(STLG) or load long (LDLG) operation, if X and Y only differ by the 
LSB bit [0], the logic in FIG. 8 would cause there to be no dependency, 
when there really is a dependency. Therefore, for a long word operation 
the STLG and LDLG flags must be ORed with the output of the [0] bit 
XNOR to assure that all dependencies are detected. 

A data dependency check flowchart for IXS1 and IYS/D is shown 
in FIG. 5. DDC 108 first checks whether IXS1 and IYS/D are in the 
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same register file, as shown at a conditional block 502. If they are not in 
the same register file there is no dependency. Hiis is shown at block a 
504. If there is a dependency, DDC 108 then determines whether DCS1 
and IYS/D are in the same register, as shown at a block 506. If they are 
5 not in the same register, flow proceeds to a conditional block 508 where 
DDC 108 determines whether IY is a long word operation. If IY is not 
a long word operation there is no dependency and flow proceeds to a 
block 504. If IY is a long word operation, flow then proceeds to a 
conditional statement 510 where DDC 108 determines whether IXS1 and 

10 IYS/D+1 are the same register. If they are not, there is no dependency 
and flow proceeds to a block 504. If IXS1 and IYS/D+1 are the same 
register, flow proceeds to a conditional block 512 where DDC 108 
determines if IY has a valid destination. If it does not have a valid 
destination, there is no dependency and flow proceeds to block 504. If IY 

15 does have a valid destination, flow proceeds to a conditional block 514 
where DDC 108 determines if IXS1 has a valid source register. Again, if 
no valid source register is detected there is no dependency, and flow 
proceeds to a block 504. If a valid source register is detected, DDC 108 
has determined that there is a dependency between DCS1 and IYX/D, as 

20 shown at a block 516. 

A more detailed discussion of data dependency checking is found 
in commonly owned, copending application Serial No. 
(Attorney Docket No. SP041/1397.0190000), the disclosure of which is 
incorporated herein by reference. 

25 Because it is possible that an instruction might get one of its inputs 

from a register that was written to by several other instructions, the 
present invention must choose which one is the real dependency. For 
example, if instructions 2 and 5 write to register 4 and instruction 7 reads 
register 4, then instruction 7 has two possible dependencies. In this case, 

30 it is assumed that since instruction 5 came after instruction 2 in the 
program, the programmer intended instruction 7 to use instruction 5's 
result and not instruction 2's. So, if an instruction can be dependent on 
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several previous instructions, RRC 112 will consider it to be dependent 
on the highest numbered previous instruction. 

Once TAL 122 has determined where the real dependencies are, 
it must locate the inputs for each instruction. In a preferred embodiment 

5 of the present invention, the inputs can come from the actual register file 
or an array temporary buffers 116. RRC 112 assumes that if an 
instruction has no dependencies, its inputs are all in the register file. In 
this case, RRC 112 passes the IXS1, DCS2 and IXS/D addresses that came 
from IFIFO 102 to the register file. If an instruction has a dependency, 

10 then RRC 112 assumes that the data is in temporary buffers 116. Since 
RRC 112 knows which previous instruction each instruction depends on, 
and since each instruction always writes to the same place in temporary 
buffers 116, RRC 112 can determine where in temporary buffers 116 an 
instruction's inputs are stored. It sends these addresses to register file 

15 read ports 119 and register file 117 outputs the data from temporary 

buffers 116 so that the instruction can use it 

The following is an example of tag assignments: 

0: add rO, rl, r2 
1: add rO, r2, r3 
20 2: add r4, r5, r2 

3: add r2, r3, r4 

The following are the dependencies for the above operations 
(dependencies are represented by the symbol "#"): 
I1S2#I0S/D 

25 KS1#I0S/D 
BS1#I2S/D 
I3S2#I1S/D 

First, look at 10; since it has no dependencies, its tags are equal to 

its original source register addresses: 

30 I0S1 TAG = I0S1 = rO 

I0S2 TAG = I0S2 = rl 
I0S/D TAG = I0S/D = r2 

II has one dependency, and its tags are as follows: 
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1151 TAG = I1S1 = iO 

1152 TAG = IOS/D = tO 

where: (tO = inst O's slot in temporary buffer) 
I1S/D TAG = I1S/D = r3 

5 12 is also independent 

1251 TAG = I2S1 = r4 

1252 TAG = I2S2 = r5 
I2S/D TAG = I2S/D = r2 

I3S1 has two possible dependencies, IOS/D and 12S/D. Because 
10 TAL 122 must pick the last one (highest numbered one), J2S/D is chosen. 

1351 TAG = I2S/D = t2 

1352 TAG = I1S/D = tl 
I3S/D TAG = I3S/D = r4 

These tags are then sent to RPM 124 via bus 126 to be selected by 
15 Issuer 118. At the same time TAL 122 is preparing the tags, it is also 
monitoring the outputs of DCL 130 and passing them on to Issuer 118 
using bus 120. TAL 122 chooses the proper outputs of DCL's 130 to 
pass to Issuer 118 by the same method that it chooses the tags that it 
sends to RPM 124. 

20 Continuing the example, TAL 122 sends the following ready signals 

to Issuer 118: 

I0S1 INFO = 1 

(Inst 0 is independent so it can start 
immediately) 
25 I0S2 INFO = 1 

IOS/D INFO = 1 

1151 INFO = 1 

1152 INFO= DONE[0] 

(DONE[0] = 1 when 10 is done) 
30 I1S/D INFO = 1 

1251 INFO = 1 

1252 INFO = 1 
I2S/D INFO = 1 

I3S1 INFO = DONE[2] 
35 I3S2 INFO = DONE[l] 

I3S/D READ = 1 
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(The DONE signals come from DCL 130 via a bus 132. In 
connection with the present invention, the term "done" means the result 
of the instruction is in a temporary buffer or otherwise available at the 
output of a functional unit Contrastingly, the term "terminate" means the 
5 result of the instruction is in the register file.) 

"Riming now to FIG. 3, a representative block diagram of TAL 122 
will be discussed. TAL 122 comprises 8 tag assignment logic blocks 302. 
Each TAL block 302 receives the corresponding data dependency results 
via buses 114, as well as further signals that come from the computer's 

10 Instruction Decode and control logic (not shown). The BKT bit signal 
forms the least significant bit of the tag. DONEjX] flags are for 
instructions 0 through 6, and indicate if instruction X is done. 
DBLREGfX] flags indicates which, if any, of the instructions is a double 
(long) word. Each TAL block 302 also receives its own instructions 

15 register addresses as inputs. The Misc. signals, DBLREG and BKT 
signals are all implementation dependent control . signals. Each TAL 
block 302 outputs 3 TAGs 126 labeled IXS1, DCS2 and IXS/D, which are 
6 bits. TAL 122 outputs the least significant 5 bits of each TAG signal to 
RPMs 124 and the most significant TAG to Issuer 118. 

20 Each block 302 of FIG. 3 comprises three Priority Encoders (PE), 

one for SI, one for S2 and one for S/D. There is one exception however. 
10 requires no tag assignment Its tags are the same as the original SI, S2 
and S/D addresses, because 10 is always independent 

An illustrative PE is shown in FIG. 9. PE 902 has eight inputs 904 

25 and eight outputs 906. Inputs 904 for PE 902 are outputs 114 from DDC 
108 which show where dependencies exist For example, in the case of 
source register 1 (SI), I7S1 tag assign PE 902's seven inputs are the seven 
outputs 114 of DDC 108 that indicate whether I7S1 is dependent on I6D, 
whether I7S1 is dependent on I5D, and so on down to whether I7S1 is 

30 dependent on I0D. An eighth input shown at reference number 908, is 
always tied high because there should always be an output from PE 902. 
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As stated before, if an instruction depends on several previous 
instructions, PE 902 will select and output only the most previous 
instruction (in program order) on which there is a dependency. This is 
accomplished by connecting the signal showing if there is a dependency 

- 5 on the most previous instruction to the highest priority input of the PE 
902 and the signal showing if there is a dependency on the second most 
previous instruction to the input of PE 902 with the second highest 
priority and so on for all previous instructions. The input of the PE 902 
with the lowest priority is always tied high so that at least one of PE 902*s 

10 outputs will be asserted. 

Outputs 906 are used as select lines for a MUX 910. MUX 910 
has eight inputs 912 to which the tags for each instruction are applied. 

To illustrate this, assume that 17 depends on 16 and 15, then, since 
16 has a higher priority than 15, the bit corresponding to 16 at outputs 906 

15 of PE 902 will be high. At the corresponding input 912 of MUX 910 will 
be I6*s tag for SI (recall PE 902 is for I7S1). Because 17 is dependent on 
16, the location of I6's result must be output from MUX 910 so that it can 
be used by 17. I6's tag will therefore be selected and output on an output 
line 914. I6's done flag, DONE[6] must also be output from MUX 910 

20 so that Issuer 118 will know when I7's input is ready. This data is passed 
to Issuer 118 via bus 120. Since an instruction can have up to three 
sources, TAL 122 monitors up to three dependencies for each instruction 
and sends three vectors for each instruction (totalling 24 vectors) to Issuer 
118. If an instruction is independent, TAL 122 signals to Issuer 118 that 

25 the instruction can begin immediately. 

The MSB of the tag outputs which are sent to RPMs 124 is used 
to indicate if the address is a register file address or a temporary buffer 
address. If an instruction is independent, then the five LSB outputs 
indicate the source register address. For instructions that have 

30 dependencies: the second MSB indicates that the address is for a 64 bit 
valve; the third through fifth MSB outputs specify the temporary buffer 
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address; and the LSB output indicates which bucket is the current bucket, 
which is equal to the BKT signal in TAL 122. 

Like DDC 108, TAL 122 has numerous implementation dependent, 
(i.e., special cases) that it handles. First, in an embodiment of the present 
5 invention, register number 0 of the register file is always equal to 0. 
Therefore, even if one instruction writes to register 0 and another reads 
from register 0, there will be no dependency between them. TAL 122 
receives three signals from Instruction Decode Logic (DDL; not shown) for 
each instruction to indicate if one of that instruction's sources is register 

10 0. If any of those is asserted, TAL 122 will ignore any dependencies for 
that particular input of that instruction. 

Another special case occurs because under some circumstances, an 
instruction in bucket 0 will be guaranteed to not have any- of the 
instructions in bucket 1 dependent on it A four bit signal called 

15 BKTl_NODEP_ is sent to RRC 112 from the ffiU control logic (not 
shown) and if BKTl_NODEP[X] = 1 then RRC 112 knows to ignore any 
dependencies between instructions, 4,5,6 or 7 and instruction X. 

An example for TAG assignment of instruction 7's source 1 (I7S1) 
is shown in a flowchart in FIG's. 6A-6B. TAL 122 first determines 

20 whether I7S1 is register 0, as shown at a conditional block 602. If the first 
source operand for 17 is register 0, the TAG is set equal to zero, and the 
I7Sl's INFO flag is set equal to one, as shown in a block 604. If the first 
source operand (SI) for 17 is not register 0, TAL 122 then determines if 
I7S1 is dependent on I6S/D, as shown at a conditional block 606. If I7S1 

25 is dependent on I6S/D flow then proceeds to a block 610 where I7Sl's 
TAG is set equal to {1,DBLREG[6],0,1,0,BKT} and I7Sl's INFO flag is 
set equal to DONE[6], as shown at a block 610. If either of the condition 
tested at a conditional block 606 is not met, flow proceeds to conditional 
block 612 where TAL 122 determines if I7S1 is dependent on I5S/D. If 

30 there is a dependency, flow then proceeds to block 616 where TAL 122 
sets I7Sl's TAG equal to {1,DBLREG[5],0,0,1,BKT} and I7Sl's INFO 
flag is set equal to DONE[5]. If the condition tested at block 612 is not 
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met, flow proceeds to a block 618 where TAL 122 determines if I7S1 is 
dependent on I4S/D. 

As evident by inspection of the remaining sections of FIG's. 6A 
and 6B, similar TAG determinations are made depending on whether 
I7S1 is dependent on I4S/D, BS/D, I2S/D, I1S/D and IOS/D, as shown at 
sections 620, 622, 624, 626 and 628, respectively. Finally, if instruction 7 
is independent of instruction 0 or if all instructions in bucket 1 are 
independent of instruction 0 (i.e., if BKTl_NODEP[d] = l),*s tested at 
a conditional block 630, the flow proceeds to block 632 where TAL 122 
sets I7Sl's TAG equal to {0J7S1} and I7Sl»s INFO flag equal to 1. It 
should be noted for the above example that I7S1 TAG signals are 
forwarded directly the register file port MUXes of register file 117. The 
I7S1 INFO signals are sent to Issuer 118 to tell it when I7's SI input is 
ready. 

A representative block diagram of Issuer 118 is shown in FIG. 10. 
In a preferred embodiment, Issuer 118 has one scanner block 1002 for 
each resource (functional unit) that has to be allocated. In this example, 
Issuer 118 has scanner blocks FU1, FU2, FU3, FU4 through FUn. 
Requests for functional units are generated from instruction information 
by decoding logic (not shown) in a known manner, which are sent to 
scanners 1002 via bus 123. Each scanner block 1002 scans from 
instruction 10 to 17 and selects the first request for the corresponding 
functional unit to be serviced during that cycle. 

In the case of multiple register files (integer, floating and/or 
boolean), Issuer 118 is capable of issuing instructions having operands 
stored in different register files. For example, an ADD instruction may 
have a first operand from the floating point register file and a second 
operand from the integer register file. Instructions with operands from 
different register files are typically given higher issue priority (i.e., they are 
issued first). This issuing technique conserves processor execution time 
and functional unit resources. 
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In a further embodiment in which EEU 100 may include two 
ALU's, ALU scanning becomes a bit more complicated. For speed 
reasons, one ALU scanner block scans from 10 to 17, while the other 
scanner block scans from 17 to 10. This is how two ALU requests are 
selected. With this scheme it is possible that an ALU instruction in 
bucket 1 will get issued before an ALU instruction in bucket 0, while 
increasing scanning efficiency. 

Scanner outputs 1003 are selected by MUXing logic 1004. A set 
of SELect inputs 1006 for MUX 1004 receive three 8-bit vectors (one for 
each operand) from TAL 122 via bus 120. The vectors indicate which of 
the eight instructions have no dependencies and are ready to be issued. 
Issuer 118 must wait for this information before it can start to issue any 
instructions. Issuer 118 monitors these vectors and when all three go high 
for a particular instruction, Issuer 118 knows that the inputs for that 
instruction are ready. Once the necessary functional unit is ready, the 
issuer can issue that instruction and send select signals to the register file 
port MUXes to pass the corresponding instructions outputs to register file 
117. 

In a preferred embodiment of the present invention, after Issuer 
118 is done it provides two 8-bit vectors per register file back to RRC 112 
via MUXOUTputs 1008 to bus 121. These vectors indicate which 
instructions are issued this cycle, are used a select lines for RPMs 124. 

The maximum number of instructions that can be issued 
simultaneously for each register file is restricted by the number of register 
file read ports available. A data dependency with a previous uncompleted 
instruction may prevent an instruction from being issued. In addition, an 
instruction may be prevented from being issued if the necessary functional 
unit is allocated to another instruction. 

Several instructions, such as load immediate instructions, Boolean 
operations and relative conditional branches, may be issued, 
independently, because they may not require resources other than register 
file read ports or they may potentially have no dependencies. 
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The last section of RRC 112 is the register file port MUX (RPM) 
section 124. Hie function of RPMs 124 is to provide a way for Issuer 118 
to get data out of register files 117 for each instruction to use. RPMs 124 
receive tag information via bus 126, and the select lines for RPMs 124 
5 come from Issuer 118 via a bus 121 and also from the computer's EEU 
control logic. The selected TAGs comprise read addresses that are sent 
to a predetermined set of ports 119 of register file 117 using bus 128. 

The number and design of RPMs 124 depend on the number of 
register files and the number of ports on each register file. One 

10 embodiment of RPMs 124 is shown in FIG. 4. In this embodiment, 
RPMs 124 comprises 3 register port file MUXes 402, 404 and 406. MUX 
402 receives as inputs the TAGs of instructions 0-7 corresponding to the 
source register field SI that are generated by TAL 122. MUX 404 
receives as inputs the TAGs of instructions 0-7 corresponding to the 

15 source register field S2 that are generated by TAL 122. MUX 406 
receives as inputs the TAGs of instructions 0-7 corresponding to the 
source/destination register field S/D that are generated by TAL 122. The 
outputs of MUXes 402, 404 and 406 are connected to the read addresses 
ports of register file 117 via bus 128. 

20 RRC 112 and Issuer 118 allow the processor to execute instructions 

simultaneously and out of program order. An IEU for use with the 
present invention is disclosed in commonly owned, co-pending application 
Serial No. 07/817,810 (Attorney Docket No. SP015/1397.0280001), the 
disclosure of which is incorporated herein by reference. 

25 While various embodiments of the present invention have been 

described above, it should be understood that they have been presented 
by way of example, and not limitation. Thus the breadth and scope of 
the present invention should not be limited by any of the above-described 
exemplary embodiments, but should be defined only in accordance with 

30 the following claims and their equivalents. 
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CLAIMS 

What is claimed is: 

1 1. A register renaming system for out-of-order execution of a set of 

2 reduced instruction set computer instructions having addressable 

3 source and destination register fields, adapted for use in a 

4 computer having an instruction execution unit with a register file 

5 accessed by read address ports and for storing instruction 

6 operands, the system comprising: 

7 (a) data dependance check means for deterniining data 

8 dependencies between the instructions; 

9 (b) tag assignment means for generating one of more 
10 tags to specify the location of operands, based on said data 
H dependencies determined by said data dependance check means; 

12 and 

13 (c) register file port means for selecting said tags 

1 4 generated by said tag assignment means and passing said tags onto 

15 the read address ports of the register file for storing execution 

16 results. 

1 2. The system of claim 1, wherein said data dependance check means 

2 determines said data dependencies by comparing the addresses of 

3 the source register field of each instruction to the addresses of the 

4 destination register fields. 

1 3. The system of claim 1, further comprising temporary storage means 

2 for temporarily storing out-of-order execution results, wherein said 

3 out-of-order execution results are passed to the register files in 

4 order after execution of the set of instructions is completed. 
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1 4. A register renaming method for performing out-of-order execution 

2 of a set of reduced instruction set computer instructions having 

3 addressable source and destination register fields, adapted for use 

4 in a computer having an instruction execution unit with a register 

5 file accessed by read address ports and for storing instruction 

6 operands, the method comprising the steps of: 

7 (1) detennining data dependencies between the 

8 instructions; 

9 (2) performing at least one of in-order and out-of-order 

10 issuing of two or more of the instructions in the instruction 

11 execution unit; 

12 (3) storing, temporarily, any out-of-order results in 

13 temporary storage means; 

14 (4) generating one or more tags to specify the location 

1 5 of said out-of-order results based on said data dependencies; 

16 (5) selecting appropriate ones of said tags corresponding 

17 to the issued instruction; and 

18 (6) passing said selected tags onto the read address ports 

19 of the register file for one of: 

20 (1) accessing said out-of-order results; and 

21 (ii) storing execution results. 

1 5. The method of claim 4, wherein said determining step further 

2 comprises the step of comparing the addresses of the source 

3 register field of each instruction to the addresses of the destination 

4 register fields. 

1 6. The method of claim 4, further comprising the step of storing 

2 in-order results in the register file and the temporary storage 

3 means. 
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1 7. The method of claim 4, farther comprising the step of passing said 

2 out-of-order results to the register files in-order after execution of 

3 the set of instructions is completed. 

1 8. A method for issuing instructions in a superscalar reduced 

2 instruction set computer haying an instruction issuer system 

3 capable of issuing a plurality of instructions in a single cycle, the 

4 system having more than one register file sets, the method 

5 comprising the steps of: 

6 prioritizing instructions to be issued according to the 

7 number of different register file sets furnishing operands to the 

8 instruction; and 

9 issuing instructions according to said prioritizing step. 

1 9. The method according to claim 8, wherein said prioritizing step 

2 assigns a higher priority to those instructions having operands 

3 furnished from the greater number of register file sets. 

1 10. An instruction issuer system in a superscalar reduced instruction 

2 set computer, the system capable of issuing a plurality of 

3 instructions in a single cycle, the system having more than one set 

4 of register files, the system comprising: 

5 first means for prioritizing instructions to be issued 

6 according to the number of different register files furnishing 

7 operands to the instruction; and 

8 second means, responsive to said first means, for 

9 issuing instructions according to said priority. 

1 11. The system according to claim 10, wherein said first means further 

2 assigns a higher priority to those instructions having operands 

3 furnished from the greater number of register files. 
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SUPERSCALAR RISC INSTRUCTION SCHEDULING 



ABSTRACT 



A register renaming system for out-of-order execution of a set of 
reduced instruction set computer instructions having addressable source 
and destination register fields, adapted for use in a computer having an 
instruction execution unit with a register file accessed by read address 
ports and for storing instruction operands. A data dependance check 
circuit is included for determining data dependencies between the 
instructions. A tag assignment circuit generates one of more tags to 
specify the location of operands, based on the data dependencies 
determined by the data dependance check circuit. A set of register file 
port multiplexers select the tags generated by the tag assignment circuit 
and pass the tags onto the read address ports of the register file for 
storing execution results. 
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